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1. Introduction 

Power-law distributions have attracted particular attention for their mathemat¬ 
ical properties and appearances in a wide variety of scientific contexts, from 
physical and biological sciences to social and man-made phenomena. 

Differently from those Normally distributed, empirical quantities distributed 
according to a power-law do not cluster around an average value, and thus we 
can not characterize a power-law distribution through its mean and standard 
deviation. However, the fact that some scientific observations can not be char¬ 
acterized as simply as other measurements is often a sign of complex underlying 
processes that deserve further study [1]. 

Formally, a discrete or continuous quantity x is distributed according to a 
power-law if its probability distribution is 

p(x) oc x ~ a , 

where a is called the scaling parameter of the distribution. However, only few 
empirical phenomena show such a probability distribution for all values of x. 
More often, the power-law applies only for values greater than some minimum 
value x m i n , so that only the tail of a distribution behaves according to a power- 
law. 


1 






/Two samples test for discrete power-law distributions 


2 


A complete introduction to power-law distributions along with a statistical 
framework for discerning and quantifying power-law behavior in empirical data 
can be found in [1], whereas extensive discussions can be found in [2, 3, 4], and 
references therein. 

In this paper, we limit our attention to discrete power-law distributions. In 
particular, we focus on the comparison of two samples of discrete data to assess 
whether they are drawn from the same power-law distribution. For example, we 
could be interested in assessing whether the number of likes received by posts 
on Facebook and the number of likes received by videos on YouTube follow the 
same power-law distribution, or whether the degrees of the nodes belonging to 
two separate clusters in a small-world network are distributed according to the 
same power-law distribution. To answer those questions we can not rely on the 
Kolmogorov-Smirov test [5, 6]. Since such a statistical test is formulated for 
continuous distributions, its application to discrete distributions will produce 
approximate p-values due to the presence of ties. 

In order to overcome such an issue, we introduce a statistical hypothesis 
test based on the log-likelihood ratio to assess whether two samples of discrete 
data are drawn from the same population. Under the null hypothesis, i.e. the 
two samples are drawn from the same power-law distribution, the resulting test 
statistics follows the y 2 distribution with 1 degree of freedom. 

This paper is structured as follows. In Section 2 we provide some basic defini¬ 
tions about discrete power-law distributions. In Section 3 we first introduce and 
discuss the proposed statistical test, and then we conclude with two examples 
illustrating how the test performs. 


2. Definitions 


Discrete power-law distributions. Power-law distributions can be contin¬ 
uous or discrete. Here we focus on the discrete case, i.e. when the quantity of 
interest can assume only positive integers. Let x represents the quantity whose 
distribution we are interested in. The probability distribution is 


p(x) = Pr(X = x ) 


x 


— a. 


C Tmin ) 


where 


£(n, Xmin') — 'y ' (il “b Xrfiin ) 


n —0 


is the generalized or Hurwitz zeta function. It is useful to consider also the 
complementary cumulative distribution function, 


P(x) = Pr(X > x) = 


C{a,x) 


since it allows to plot power-law distributions in doubly logarithmic axes, and 
thus emphasize the upper tail behavior. 
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Estimating the scaling parameter. The method for fitting parametrized 
models to observed data is the method of maximum likelihood, which prov- 
ably gives accurate parameter estimates in the limit of large sample size [7, 
8]. The likelihood function of a power-law distribution given a sample x = 
(xi,x 2 , ...,x n ) is 

n 1 

L(a, x m in) = a. x min a _|_i > 
i= 1 X i 

whereas the log-likelihood function is 

n 

1{ol , Xmin) = n\na + na\nx m in + (a + 1) ^ In a*. 

2=1 


Assuming that data are drawn from a power-law distribution with x > x m i n , 
we can derive a maximum likelihood estimators (MLE) of the scaling parameter 
a. Although there is no exact closed form expression for the MLE in the discrete 
case, an approximate expression can be derived using an approach that considers 
power-law distributed integers approximated as continuous reals rounded to the 
nearest integer (details of the derivation are given in Appendix B of [1]). The 
result is 


a ~ 1 + n 


J2 ln 


x% 

Xmin 


1 
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3. Two samples test for discrete power-law distributions 

Statistical hypothesis test. Suppose that we have two samples of discrete 
data, si and s 2 , and we want to assess whether the two samples are drawn 
from the same power-law distribution. We can not rely on the Kolmogorov- 
Smirnov test. Indeed, since such a statistical test is formulated for continuous 
distributions, its application to discrete distributions will produce approximate 
p-values due to the presence of ties. In order to overcome such an issue, we 
propose a statistical test based on the log-likelihood ratio: 

A = —2 x l(H 0 \si U S 2 ) + 2 x [7(.f/i|si) + l(Hi\s 2 )], 

where l(H 0 \si U S 2 ), i.e. the null model, is the log-likelihood of the pooled 
samples, si U s 2 , whereas l(Hi\s\) + l(Hi\s 2 ), i.e. the alternative model, is the 
sum of the log-likelihoods of the samples Si and s 2 . 

Under the null hypothesis, i.e. the two samples are drawn from the same 
power-law distribution, the obtained test statistics A will follow a y 2 distribution 
with 1 degree of freedom. Indeed, df = 2 — 1 = 1, since in the alternative 
hypothesis we need to estimate two parameters, a Sl and a S2 , whereas in the 
null model we need to estimate only one parameter d SlUs2 . 
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Example 1: two samples drawn from the same distribution. Figure 1 
shows the complementary cumulative distribution function, P(x), of two sam¬ 
ples, si (n Sl = 10 5 ) and S 2 ( n S2 = 10 5 ), that we want to compare in order 
to assess whether they are drawn from the same power-law distribution. The 
estimates of the scaling parameters are, respectively, a Sl = 1.997852 and a S2 = 
1.99816. The test statistics is A = 0.006508615 with p-value 0.9356996 > 0.05, 
thus we fail to reject the null hypothesis and conclude that the two samples 
are drawn from the same power-law distribution. Indeed, these samples were 
randomly drawn from a power-law distribution with x m i n = 1 and a = 2. 



X 


Fig 1. Two samples drawn from the same distribution. Two samples of discrete data s \ 
(n Sl = 10 5 ,) and S 2 (n S2 = lO 5 ^ are compared. The estimates of the scaling parameters are, 
respectively, o; Sl = 1.997852 and a S2 = 1.99816. The test statistics is A = 0.006508615 with 
p-value 0.9356996 > 0.05, thus we fail to reject the null hypothesis and conclude that the two 
samples are drawn from the same power-law distribution. 


Example 2: two samples drawn from different distributions. Figure 2 
shows the complementary cumulative distribution function, P{x), of two sam¬ 
ples, si (n Sl = 10 s ) and S 2 ( n S2 = 10 s ), that we want to compare in or¬ 
der to assess whether they are drawn from the same power-law distribution. 
The estimates of the scaling parameters are, respectively, d Sl = 2.003242 and 
a S2 = 2.051032. The test statistics is A = 149.4912 with p-value < 10 —3 , thus we 
reject the null hypothesis and conclude that the two samples are not drawn from 
the same power-law distribution. Indeed, these samples were randomly drawn 
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from two power-law distributions with different scaling parameters, respectively, 
a i = 2 and 0:2 = (2 + <5), with S = 0.05. 



X 


Fig 2. Two samples drawn from different distributions. Two samples of discrete data si 
(n si = 1 (fi) and S 2 (n S2 = W 5 ) are compared. The estimates of the scaling parameters are, 
respectively, & S1 = 2.003242 and a S2 = 2.051032. The test statistics is A = 149.4912 with 
p-value < 10 -3 , thus we reject the null hypothesis and conclude that the two samples are not 
drawn from the same power-law distribution. 
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